29694f4efestartup fixes
Matt Wells
2014-03-08 10:25:56 -07:00
8aa0662a27Merge branch 'diffbot' into testing
Matt Wells
2014-03-08 09:38:44 -07:00
14817df7a9new site patterns api stuff
Matt Wells
2014-03-08 09:23:32 -07:00
7cdd411ef1Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 09:26:47 -08:00
72fab5b61eDo not end a crawl while urls are still being spidered because they might add more links to spiderdb when they finally complete.
Matt Wells
2014-03-07 09:30:12 -08:00
dcd42e455eMerge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 09:02:29 -08:00
c143ee1fbafix core when creating a new collection because we incremented m_numRecs but did not grow the ptr buffer. also added support for localgb.conf so we can use that instead of gb.conf to avoid git push/pull conflicts.
Matt Wells
2014-03-07 09:05:14 -08:00
f777e6cccdMerge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 08:23:21 -08:00
d6177019ecminor fix
Matt Wells
2014-03-07 08:07:09 -08:00
434dd182d4fix mem leak. always harvest links for custom crawls.
Matt Wells
2014-03-06 21:24:39 -08:00
e351d2a6f1get searching on token working
Matt Wells
2014-03-06 17:01:41 -08:00
27e8e810d2use collnum instead of coll string. more stable since resetting collections keeps string the same but changes the collnum.
Matt Wells
2014-03-06 15:48:11 -08:00
d74f748e93search all collections under a token if "&token" is given but not "&c=..."
Matt Wells
2014-03-06 11:00:43 -08:00
97e46dbf4eMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-06 10:45:45 -08:00
ca2d307229revert gb.conf
Matt Wells
2014-03-06 10:47:03 -08:00
efa92b16fdMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-06 10:45:35 -08:00
25cf0efdbffirst compiled stab at multi collection searching.
Matt Wells
2014-03-06 10:45:13 -08:00
451a092378fix core from changing parms while evaluating a url.
Matt Wells
2014-03-06 07:47:43 -08:00
0962e243a4Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-05 07:43:25 -08:00
58a1feeea5specify &header=1 explicitly to get json serp header lest we break our clients parsers
Matt Wells
2014-03-05 07:41:59 -08:00
13e33bc261fix jezebel crawl from hanging.
Matt Wells
2014-03-04 19:45:26 -08:00
1b62f1582bprint memtable when almost full so we can see where the leak is. more spiders for ethan. do not try to get diffbot reply if page is already json. likely it is an injected diffbot json reply.
Matt Wells
2014-03-04 18:19:50 -08:00
603cd67758fix csv downloads some more
Matt Wells
2014-03-04 12:07:46 -08:00
2ab9aaeeaastreaming csv fixes
Matt Wells
2014-03-04 11:04:26 -08:00
866b09d25eMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-04 10:46:28 -08:00
b1381cc610make csv streamable, faster and take almost no memory.
Matt Wells
2014-03-04 10:45:57 -08:00
280dcb85cffix for passing testUpdatedContent smoketest
Matt Wells
2014-03-04 09:09:51 -08:00
ab9f2b33c1definition updates
Matt Wells
2014-03-04 08:37:39 -07:00
1acb16b1eetweak empty doledb priority logic. anchor it more to m_doleIpTable for more reliability. seems like it was causing some slow dows during spidering. seems more continuous now.
Matt Wells
2014-03-03 13:48:59 -08:00
48b5330d9conly skip checking to spider a url of its doleip table is empty
Matt Wells
2014-03-03 13:22:27 -08:00
282dad6cefdeal with no coll recs when getting link text using msg25. do not share g_lineTable between collections.
Matt Wells
2014-03-03 08:04:24 -08:00
ff8a0b4ef1do not let all collections share the same line table in linkdb.cpp
Matt Wells
2014-03-03 07:50:11 -08:00
a82abe8260added ^ operator to url crawl patterns. good for tmz crawl.
Matt Wells
2014-03-02 14:57:59 -08:00
7fd6bbd7f5added ^ support to url crawl expressions
Matt Wells
2014-03-02 14:41:25 -08:00
e4d425c18ffix coll being deleted when getting link text.
Matt Wells
2014-03-02 14:24:49 -08:00
bb5016e88badd the following fields to json search results: currentTimeUTC, responseTimeMS, docsInCollection, hits, moreResultsFollow, and docId. Changes structure of json so that now the results array is returned as an array within a dictionary (field name "results") as opposed to being the only object returned
Daniel Steinberg
2014-03-01 11:16:17 -08:00
aeb2833d20Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-28 11:46:44 -08:00
11efab9862Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-28 08:23:59 -08:00
c596d38e60fix core from getting title of json object
Matt Wells
2014-02-28 08:18:09 -08:00
5f3aa24805took out restrictDomain logic. now we always only follow links on the same domain as the seed UNLESS a url crawl pattern or a url crawl regex was specified.
Matt Wells
2014-02-27 19:53:17 -08:00
42f254125efix core in new link text logic. empty msg25 replies are ok if g_errno is set.
Matt Wells
2014-02-27 13:56:32 -08:00
365fc16606fix core in "wait in line" logic when getting link info in Linkdb.cpp.
Matt Wells
2014-02-27 09:22:35 -08:00
af9eb8fb73need to allow clients to not restrict to seed domains.
Matt Wells
2014-02-26 22:27:22 -08:00
927f4626eeMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-26 22:26:13 -08:00
eaca38cbfdfix new result streaming logic some more
Matt Wells
2014-02-26 21:42:43 -08:00
0933884191fix super fast and mem efficient search results streaming code.
Matt Wells
2014-02-26 21:18:08 -08:00
f11e25024aMerge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-26 20:34:06 -08:00
1030e6ada8Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-26 20:30:20 -08:00
b429f12346add logic to save memory when streaming over 200 results back. should fix oom when streaming back hundreds of thousands of results.
Matt Wells
2014-02-26 20:33:35 -08:00
8208178c79remove "Initial crawl request" dups from the urls.csv. do not count fake firstip spider requests attempts in xmldoc.cpp as crawlbot page download attempts since we just re-add that request with the correct firstip and bail. it basically doubles this count form what users would expect.
Matt Wells
2014-02-26 15:48:52 -08:00
b450bfc2a6do not show html column in csv. libreoffice and excel flub it if a cell is over 32k or so.
Matt Wells
2014-02-26 15:03:05 -08:00
a0697e1bb5do not allow custom crawls to spider the web any more.
Matt Wells
2014-02-26 10:26:09 -08:00
6445b0572bfix gb.conf
Matt Wells
2014-02-26 01:08:20 -08:00
a6b7e088f5take out tfndb, unused. fix core from diffbot url too long.
Matt Wells
2014-02-26 01:07:13 -08:00
6716d8f21bremove entry from linetable for linkinfo lookup
Matt Wells
2014-02-26 00:27:29 -08:00
8bb5d106dbfixes for query reindex/delete.
Matt Wells
2014-02-25 18:12:45 -08:00
33c8123288more fixes for new link info code.
Matt Wells
2014-02-25 13:53:41 -08:00
9c486c77edMerge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-25 12:32:40 -08:00
cf6695f625speed up getNumTotalRecs() by caching it basically for 2 seconds since pingserver.cpp calls it all the time.
Matt Wells
2014-02-25 12:14:51 -08:00
b3ff7df904Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-25 11:05:46 -08:00
b58d88c57ffix sections infinite loop bug.
Matt Wells
2014-02-25 11:09:07 -08:00
94a55bf9a6fixes for new link info code so it doesn't bottleneck. got EFENCE_SIZE working so we can use efence on large allocs only so we don't go oom using it. might help finding some of the out of bounds writing going on.
Matt Wells
2014-02-25 10:55:05 -08:00
ceb623bb8fdo not dedup bulks. only respider urls if error is tmp. mess with msg1 in spider.cpp so niceness is MAX_NICENESS and not 0 because it was not able to trigger a doledb dump.
Matt Wells
2014-02-23 20:04:46 -08:00
72f1312652new linkdb code compiling.
Matt Wells
2014-02-20 17:27:28 -08:00
9820f14066checkpoint
Matt Wells
2014-02-20 14:54:21 -08:00
88dfa20cbedocid based spider rec related fixes.
Matt Wells
2014-02-20 08:46:00 -08:00
e87b71caeffix query reindex core
Matt Wells
2014-02-19 21:07:01 -08:00
b37b19ea4aprint comma before json item so we do not end in trailing comma ever
Matt Wells
2014-02-19 10:04:49 -08:00
dda7648333try to fix problem of crawls stopping when they shouldn't. seems like it might be doing the trick.
Matt Wells
2014-02-19 00:51:46 -08:00
b48adc0542try to fix crawls stopping too early
Matt Wells
2014-02-18 10:28:48 -08:00
ae2aed7066try to fix a few cores from deleting collections. try to spider urls again if user changes certain crawling parms. like regex, patterns, etc.
Matt Wells
2014-02-18 09:44:15 -08:00
117f0ca4e8more bio updates
Matt Wells
2014-02-17 15:32:45 -07:00
bc2b9d6179bio.html updates
Matt Wells
2014-02-17 15:29:08 -07:00
f942183104ignore maxtocrawl for bulk jobs too
Matt Wells
2014-02-16 22:24:17 -08:00
a4deb7ff08exempt bulk jobs from maxtoprocess
Matt Wells
2014-02-16 22:14:43 -08:00
9c9d5fff98print out content type in caps with maroon bg in serps. use empty site patterns to mean no restriction, not "*" anymore for simplicity.
Matt Wells
2014-02-16 22:47:02 -07:00
0b5cd6d3f9more parm fixes
Matt Wells
2014-02-16 22:18:39 -07:00
48315f6dc3parm fixes
Matt Wells
2014-02-16 22:13:27 -07:00
725b6189a7show user's ip in master ips description so they can add it to the list easily.
Matt Wells
2014-02-16 21:56:31 -07:00
9d0dca71dbfix rapid coll delete bug some more.
Matt Wells
2014-02-16 20:13:06 -08:00
def7822a22multiple red boxes for clarity
Matt Wells
2014-02-16 20:33:51 -07:00
ce652462b0add color coded circles to coll nav bar. disk usage red box.
Matt Wells
2014-02-16 19:59:53 -07:00
88a151f1d9Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-16 16:02:00 -08:00
c691b2dd5fhopcount precedence fix
Matt Wells
2014-02-16 16:01:29 -08:00
a4b6716623Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-16 15:57:46 -08:00
29c9e49935do not do link loop detection if doing a custom crawl.
Matt Wells
2014-02-16 15:57:32 -08:00
f8135e628efall back to hop count if priority is tied (and both are due to be spidered). defaults back to breadth first like it was doing before.
Matt Wells
2014-02-16 15:52:08 -08:00
0a4963f597do not allow spot/seeds to be added to collnum being repaired or rebuilt.
Matt Wells
2014-02-16 15:18:50 -08:00
fe63371622Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-16 13:39:02 -08:00
4930243de3minor updates
Matt Wells
2014-02-16 13:38:54 -08:00
fe0f2d3537allow coll delete if not the one being repaired
Matt Wells
2014-02-16 10:55:34 -08:00
32526a9b25more checksum fixes for json. fixes for repair/rebuild procedure.
Matt Wells
2014-02-16 10:46:41 -08:00
df59d3946afix content hash issues for json. do not hash url/resolved_url/html fields. do exact order-independent hashes of remaining field/value pairs. used for setting EDOCUNCHANGED and doing spidertime/querytime deduping. also do not index "html" json field because it is huge, slow and redundant. convert "date" field into a number so we can sort/constrain by article pub date.
Matt Wells
2014-02-15 14:40:56 -08:00
734ce1fc55fix core from a high priority injection insert records at the same time as a lower priority spider.
Matt Wells
2014-02-14 10:51:02 -08:00
3271f22995Merge branch 'diffbot-testing' into diffbot
Matt Wells
2014-02-13 11:25:59 -08:00
dc8b9090e8fix out of alloc slots core
Matt Wells
2014-02-13 11:21:39 -08:00
08b103f3a4Merge branch 'diffbot-testing' into diffbot
Matt Wells
2014-02-13 10:11:56 -08:00
c3d8a143befix bug of process regex being ignored when crawl regex was specified.
Matt Wells
2014-02-13 10:06:14 -08:00
4eee547391do not do fuzzy deduping if &icc=1 (include cached copy) is true for search results.
Matt Wells
2014-02-13 08:51:03 -08:00
cd6069e5a6send single space to socket if not streaming and search results still not ready after 10 seconds. send it every 10 seconds to prevent client from closing socket. sped up all downloads, json and csv, but not doing "fuzzy" deduping of search results, but just deduping on page content hash. added TcpSocket::m_numDestroys to ensure we do not send heartbeat on a socket that was closed and re-opened for another client.
Matt Wells
2014-02-13 08:45:13 -08:00