Data Scrubbing - Identifying Duplicates And Assigning One Number
Feb 11, 2014
I am involved in a software conversion that is taking 4 full time folks over 5 weeks to clean up and assign an alpha-numeric sequential number to each vendor, client. Each scrubber is reviewing an excel spreadsheet containing the names, addresses, FID, telephone, etc. of our vendors and customers. This information is being pulled from 2 separate sources. We are assigning a BP # to the main office location and not retiring that one. then we go on to identifying the dups. All dups get a Y to be retired, but if they have a different address then the main one, we place a Y to bring that address over under that BP#.
Ultimately, we end up with 2 systems combined into one dumping all old numbers assigned and giving each vendors, customers, etc. a new BP# that may have muliptple addresses.
How can we assign a alpha numeric number without going through each individual line...over 900,000 of them to do. Key is to identify duplicate addresses and duplicate names. Some names might be RK Electrical or Robert King Electrical but the address will be duplicated usually.
I have attached a sample sheet which deals with property sales data, in reference to a two-part question.
1. If the row has two, or more, rows share the same value in column 'E', it needs to be identified with a 'Y' in Column G. In looking through old threads, this seems possible, though I could not find and answer I could 'bend' to work. If this is possible, can the following be included?
2. If two or more rows share the same value in column 'E', list the identifiers (value in Column A) for the others in Column H, separated by commas "," or slashes (preferred) "/". ie. "00370600000700 'NICHOLLS JOHN W & CARLA R 11/27/2000 85000 '260647 W Y '00370600000800" and "00370600000800 'NICHOLLS JOHN W & CARLA R 11/27/2000 85000 '260647 W Y '00370600000700" or "'00370500000801 'FRAHM FREDERICK/ERIK/KRYSTYNA 06/17/2004 110000 '288904W Y '00370500000802/'00370500000803"
The sample sheet attached includes 26 rows of data with several 'doubles' and one 'triple' 'duplicates'.
Please note that in the 'real' file, it has slightly less than 200,000 rows and I have seen 'dupicates' up to 40 with the same value in Column 'E'
Here is the deal I have 4 columns. Each line gives you the following information: The Id is the record number, Code_Name is a code for each Fox in the study, date and area is a sub area in a bigger grid. Basically I have an area divided by squares and every time a marked fox enters in one of my squares a new line in the data is created. What I won’t to know is if a fox when in my area of study will return to the same squares or not.
ID
CODE_NAME
Date
Area
Logical 116
SSS1 02-Jan-09 1A1
273
RRR1 02-Jan-09 2A2
2959
BBB1 02-Jan-09 1B1
2959
What I’m trying to achieve is a 5th column were ill get a logical value of TRUE or FALSE if, for each fox in the next available record a fox went back to the same square or not. So if you check for the fox RRR1 I have 2 records one in 2 of January in area 1A1 and a second in 7 of January in the same area. For the Fox BBB1 you will see that she was always in different areas and for CCC1 she only came back to one square.
The problem is I have over 400 fox’s and 12000 records and I’m trying to get a way of doing it automatically.
How do I go about assigning a number to a particular piece of data? To give a simple example of what I'm trying to do, and what I envision, I'll pretend I'm building a sandwich.
So, someone could come along and build their sandwich, ham/turkey/tomato/mayo on white bread. Excel would then recognize that 1+3+4+7+8=23 23=The American (the name of the sandwich, which I've already assigned a variable to)
I'm using the sandwich model because its a lot simpler than what I'm attempting to do.
I have organized account numbers from two systems onto a spreadsheet, with numbers from System 1 arrayed in Column A and numbers from System 2 arrayed in Column B. I need to evaluate the numbers in both columns and isolate the numbers that are NOT DUPLICATES across the two systems (Columns A and B) and return a list of non-duplicate numbers in Column C. Here is what the table would look like:
I would like to identify duplicates in a list using conditional formatting in Excel 2007.
I have tried choosing to identify duplicates using the formula that I have found on many threads throughout the message board:
=COUNTIF($A$1:$A1,$A1)>1.
This function works up to 15 characters in a cell, but Excel seems to be treating all digits after the first 15 as the same, resulting in a "fuzzy match" where I want an exact match. Many of the values in my list are 18 characters long, in text format to prevent rounding.
I've noticed that Excel treats the 18-characters values the same way when sorting; for example, it treats these two values as the same:
'234567891011121314 '234567891011122413
Is there a way to force Excel to examine those last four digits for the purpose of sorting & identifying duplicates?
I am trying to collect data from a server. The data comes through as .csv (seperated data), and I am able to get all the useless info/columns out of the way - but I would like to keep a record of how many times these "alarms" come in. form a spreadsheet, or tell me how to go abouts using a tool to simplify my process.
I am using a vlookup and have a problem. I am assigning a category to an item number based on the first two characters of the item number. For example item number 60123 would equal scrap because of the first two characters of 60. But the item number can begin with either a number or letter. Here is the formula I am using that works for item numbers that begin with numbers:
=VLOOKUP(VALUE(LEFT(E2,2)),Sheet3!A:B,2,FALSE)
It works fine until I reach a item number that begins with a letter, then I get the dreaded #Value error. If I take the value out of the formula then it works for the letter based number items but not for the number based item numbers.
I've collected some data from a GPS logger regarding the speed of an athlete. I want to calculate how many sprints this particular athlete undertook during a training session.
Sprinting is defined as a speed of > 20 kph.
One sprint would be the attainment of one peak >20 kph before decreasing below 20 kph.
It's easy to identify the 3 peaks and thus sprints from the xy scatterplot in the attached file, but I'm struggling to find a way to calculate this.
In column H I have a list of numbers seperated by a space, the number of lines can change. In column L I have a list of numbers which can change either expand or retract.
I would like to check each cell in column H and if any numbers are not listed in column L then it/they should be shown in column G.
Example1 H2 shows 6 11, therefore cell G2 should show 11.
Example 2 H6 shows 5 6 9 11 therefore G6 should show 9 11
Sheet1 HIJKL1Container ID26 11 135 8 11 245 7 11 355 7 565 6 9 11 675 6 9 Excel tables to the web >> Excel Jeanie HTML 4
I am looking for the easiest way to find duplicate Work Order numbers that exist in 2 separate Workbooks. EX. Workbook 1 Sheet one contains the numbers 1-100 in A1:A100 Workbook 2 Sheet one contains X amount of the numbers between 1-100 located.
somewhere in A:A. For arguments sake let's assume those numbers are 3,6,33,87,99. What would the formula be to return the values that are in both of the workbooks?
I have small table i would like to create. Now, it can be done manually ( but its be very very time consuming) but im sure of a way using IFs and VLOOKUPs so that the data selection can be done automatically...
so in column 1 i have various valuations from 0 to anything 50mil plus that i need to then separate into 4 different columns based on their size. so column A would have 0 - 250k, column B 251k to 500k, column C 501k to 1million and etc etc...
I am trying to slim down my database results in Excel via MS Query by searching for Part ID's that are numeric (we have parts that also contain letters....I want to weed those out).
In all my searching on the web, I thought the ISNUMERIC() function should be the function for this, but I keep getting an ORA-00904::"ISNUMERIC":invalid identifier....
Is this function supposed to work or is there another function that will do this
This is my SQL statement so far, which works to get parts that are 6 characters long only:
SELECT PART.ID FROM SYSADM.PART PART WHERE (LENGTH(PART.ID)=6) When I change it to this to get parts that are numeric, it gives the error above:
SELECT PART.ID FROM SYSADM.PART PART WHERE (LENGTH(PART.ID)=6) AND (ISNUMERIC(PART.ID)=1)
i am trying to work out how to use the rank formula to rank numbers in column B and keep them in unison with Column A.
So Column A has say 5 1's with column B having different scores then continuing under 1 in A is 2 and so on is there a way to continue the ranking formula without manually changing the cell ranges?
so =rank(B1,$B$1:$B$7,1) but can i do that if A =1 and then A=2 etc ? so if A=1,rank(B1,$B$1:$B$7,1)
As you can see the Date and Time are repeated for several rows. This is how the data I receive comes through as A,B and C refer to a single transaction and D & E refer to another transaction.
Where Excel can compare the date and times of each row and look for matching rows above and below it and then fill in a column next to it indicating that x number of rows are linked to a single transaction - preferably labelling them in some order to I can tell how many transactions there are.
Im new to Macros. Im trying to find a string of text assign that to be the top of the data and then find a different string and assign that to be the bottom of the data. Then run a loop whilst inside that data range. Am i going about it the right way? Attached is a sample data file.
492 500 773 738 572 492 When I repeat this number I need that the first (492) be formatted with a color and continuous like this 200 572 format the first 572 492 format the next 492 but the actual number stays without formatting!
Essentially, the formula/solution will "know" that the third occurence in the list is actually the 3rd occurence, and so forth... I tried COUNTIF but that just gave me the total number of occurences
I am trying to improve my expense report template and need to check on the load if expense report number has been loaded correctly as well as if this report has been previously loaded. Expense report number format looks like this: AAA-BBBBBB-CC Where:
AAA – Employee ID # BBBBBB – End of the week date
CC – Weekly expense report number For instance, 023-122008-01 means: Employee number 023, week ending date 12/20/2008, weekly expense report number 01. I would like to prevent/give warning of loading incorrect expense report number format, check for possible duplicates, and check if trying to load expense report belongs to the right person (by simply matching previously loaded in different cell of the same sheet employee ID and first three digits of just loaded expense report number. I think I know how to do all of these separate, but have no idea how to combine all three checks for one cell.
My sheet cotains a variable number of duplicates (2 up to 12 duplicates) that can each have different amounts.
Ex: 9879 = 7 9879 = 0
I would like to be able to highlight all the duplicates for that show only 0 as their value (all the duplicates must have a total value of 0). I've tried nested if functions and conditional formating but to no avail.
I've attached a small file that shows the end result.
My sheet contains a variable number of duplicates (2 up to 14 duplicates) that can each have different amounts.
Ex: 9879 = 7 9879 = 0
I would like to be able to highlight all the duplicates for that show only 0 as their value (all the duplicates must have a total value of 0). I've tried nested if functions and conditional formating but to no avail.
I've attached a small file that shows the end result. Attachment 241407
I have this macro when run copys and pastes values for a row of cells onto another sheet. However if 2 cells in the same row have a number greater than zero. It duplicates the entry. What I need is some sort of check that says if 2 or more cells in the same row have a number greater than zero just copy that row once only.
Below is a range of data i am working with, i am trying to create a formula that will count the number of different entries in column A for each different entry in column B. ie how many differnt values are there for "Packing"
008003PICKING MISTAKE 008042UNFIT FOR PURPOSE 008035PACKING 008035PACKING 008035PACKING 007960CHANGE OF MIND 007986PACKING 007986PACKING 008050UNFIT FOR PURPOSE 008070CHANGE OF MIND 008070CHANGE OF MIND 008074CHANGE OF MIND 008074CHANGE OF MIND 008074CHANGE OF MIND 008074CHANGE OF MIND 008074CHANGE OF MIND 008086PACKING 008085PACKING 008085PACKING
I need the easiest way to randomize or generate team numbers in a league. Using COL A assign numbers between 1 & 8 without duplicates. Then repeat 5 more times. This has to be done on the spot and has time constraints as the players will be waiting for their team assignments.
EXAMPLE: I have 48 players which will be assigned to 8 teams of 6. I want to randomize the drawing so the same players don't play on the same teams each week. Also to be able to adjust number of teams determined by how many players are present. either 6 , 8, or 10 teams.